Some imports:
In [ ]:
    
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
    import seaborn
except ImportError:
    pass
pd.options.display.max_rows = 10
    
By "group by" we are referring to a process involving one or more of the following steps
Similar to SQL GROUP BY
The example of the image in pandas syntax:
In [ ]:
    
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df
    
In [ ]:
    
df.groupby('key').aggregate('sum')  # np.sum
    
In [ ]:
    
df.groupby('key').sum()
    
These exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv and cast.csv and put them in the /data folder.
In [ ]:
    
cast = pd.read_csv('data/cast.csv')
cast.head()
    
In [ ]:
    
titles = pd.read_csv('data/titles.csv')
titles.head()
    
In [ ]:
    
    
In [ ]:
    
    
In [ ]:
    
    
In [ ]:
    
    
In [ ]:
    
    
In [ ]:
    
    
Sometimes you don't want to aggregate the groups, but transform the values in each group. This can be achieved with transform:£
In [ ]:
    
df
    
In [ ]:
    
def normalize(group):
    return (group - group.mean()) / group.std()
    
In [ ]:
    
df.groupby('key').transform(normalize)
    
In [ ]:
    
    
In [ ]:
    
    
In [ ]:
    
    
A useful shortcut to calculate the number of occurences of certain values is value_counts (this is somewhat equivalent to df.groupby(key).size()))
For example, what are the most occuring movie titles?
In [ ]:
    
titles.title.value_counts().head()
    
In [ ]:
    
    
In [ ]:
    
    
In [ ]:
    
    
In [ ]:
    
    
In [ ]:
    
    
In [ ]:
    
    
In [ ]:
    
    
In [ ]:
    
    
In [ ]:
    
    
In [ ]: